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Abstract 


This  project  applies  expert  system  technology  to  the  task  of  searching  online  collections 
of  documents.  We  are  developing  an  intelligent  search  intermediary  to  help  end-users 
locate  relevant  passages  in  large  full-text  databases.  Our  expert  system  will  automatically 
reformulate  contextual  Boolean  queries  to  improve  search  results  and  will  present  retrieved 
passages  in  decreasing  order  of  relevance.  It  differs  from  other  intelligent  database  functions 
in  two  ways:  it  works  with  semantically  unprocessed  text  and  the  expert  systems  contains 
a  knowledge  base  of  search  strategies  independent  of  any  particular  content  domain. 

The  goals  for  our  current  project  are  to  demonstrate  the  feasibility  of  the  approach 
and  to  evaluate  the  effectiveness  of  the  system  through  a  controlled  experiment.  While 
the  work  we  report  here  has  limited  objectives,  the  system  and  technique  are  general  and 
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can  be  extended  to  large,  real-world  databases. 
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1  Introduction 


1.1  Motivation 

As  the  cost  of  computers  decreases  and  their  capabilities  increase,  more  and  more 
professionals  will  use  personal  workstations  to  aid  them  in  their  work.  In  most  instances, 
these  powerful  personal  machines  will  be  linked  by  networks  to  large  mass  storage  devices, 
such  as  laser  disks.  Consequently,  many  knowledge  workers  have,  or  will  soon  have,  access 
to  large  full-text  databases  in  their  fields.  Without  new  tools  to  help  them  manage  and 
use  the  large  number  of  texts  that  will  be  available  on-line  to  them,  these  professionals 
will  soon  be  buried  in  information. 

Searching  existing  full-text  databases  currently  presents  two  basic  problems: 

1}  the  user  must  know  the  technical  details  of  the  retrieval  system  to  use  it  effectively 

2)  the  process  is  laborious 

To  avoid  the  technical  details  of  the  retrieval  system,  many  users  present  their  infor¬ 
mation  needs  to  a  trained  search  intermediary  who  then  searches  the  online  databases  for 
them.  This  approach  creates  several  problems.  Because  the  user  is  not  involved  in  the 
search  process,  he  receives  passages  that  the  searcher  believes  are  relevant,  based  on  the 
searcher’s  understanding  of  the  user’s  needs.  Since  the  user  often  has  only  a  vague  idea  in 
advance  of  topics  and  terms  on  which  to  search,  several  searches  are  often  required,  each 
based  on  the  results  of  the  preceding  search. 

If  the  user  does  his  own  searching,  the  process  is  likely  to  be  laborious  and  time 
consuming,  particularly  if  the  user  is  a  novice  or  infrequent  searcher.  These  individuals 
may  use  inappropriate  search  terms  and  require  many  iterations  to  improve  their  queries. 
They  may  become  frustrated,  unable  to  fine  relevant  information  they  are  sure  the  database 
contains.  Or  they  may  be  overwhelmed  by  a  flood  of  marginally  relevant  passages. 

We  are  attempting  to  address  both  problems  by  providing  an  online  search  assistant  to 
handle  the  technical  details  and  to  reformulate  search  queries  automatically.  This  approach 
offers  the  best  of  both  worlds:  the  user  is  actively  involved  in  the  search  process,  but  he 
will  need  less  training  to  achieve  satisfactory  results. 

1.2  Background 

Research  that  relates  to  our  project  can  be  found  in  several  areas.  This  includes  work 
in  user-interface  design,  information  retrieval  software,  and  artificial  intelligence. 

As  the  demand  for  direct  access  to  existing  online  information  retrieval  systems  has 
grown,  so  has  interest  in  providing  friendlier  interfaces.  Marcus  [Marcus,  1981]  and 
Meadow  [Meadow,  Hewett,  Sc  Aversa,  1982]  describe  research  prototypes  based  on  con¬ 
ventional  programming  techniques  which  make  existing  bibliographic  databases  easier  to 
search.  These  projects  have  focused  on  providing  menu  systems  to  guide  novice  users. 
The  menus  provide  information  to  the  user  about  choosing  the  correct  database,  selecting 
search  terms,  and  connecting  to  a  remote  database.  Although  many  technical  details  are 
hidden  from  the  user  and  information  is  available  online  to  prompt  him,  the  interaction  is 
still  laborious  and  these  interfaces  have  not  been  extended  to  full-text  databases. 

Many  projects  have  looked  at  the  possibility  of  allowing  users  to  query  databases  in 
natural  language,  removing  the  need  for  them  to  form  Boolean  queries.  Euzenat  and 
his  team  [Euzenat,  Normier,  Ogonowski,  Sc  Zarri,  1985]  have  produced  a  prototype  of  a 
transportable  natural  language  interface  to  database  management  systems.  This  interface 
transforms  the  users  query  into  one  that  can  be  answered  by  the  relations  defined  in  the 
database.  Defude  [Defude,  1984]  has  proposed  a  natural  language  interface  to  a  biblio¬ 
graphic  retrieval  system  incorporating  an  expert  system.  Both  systems  help  in  the  query 
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formation,  but  do  not  assist  the  user  in  refining  that  query  to  improve  search  results. 
Again,  these  systems  have  not  been  extended  to  full-text  databases. 

The  artificial  intelligence  research  that  applies  most  directly  to  information  retrieval 
is  that  on  question  answering  systems.  These  systems  build  internal  knowledge  struc¬ 
tures  from  documents  in  a  given  area  and  then  synthesize  answers  to  questions  based  on 
that  structure.  One  example  is  “Researcher”,  under  development  by  Michael  Lebowitz 
[Lebowitz,  1985].  However,  building  a  knowledge  structure  from  natural  language  text  is 
a  slow  and  error-prone  process  that  is  currently  not  feasible  for  large,  dynamic  collections 
of  documents.  Even  if  the  knowledge  structures  could  be  built  and  queried  effectively  for 
large  document  collections,  most  users  will  probably  want  to  see  the  actual  text  of  the 
original  documents,  not  just  a  synthesized  answer  to  their  query.  We  agree  with  Karen 
Sparck  Jones  that  “the  language  of  documents  is  part  of  their  information  content”  [Sparck 
Jones,  1983]. 

A  different  use  of  AI  techniques  can  be  seen  in  several  recent  projects  in  Library 
Science.  (Many  of  these  projects  are  surveyed  in  [Jones,  1984]  and  [Smith,  1980].)  Most 
interesting  is  the  research  on  expert  systems  that  help  end-users  to  do  their  own  searching. 
Pollitt  [Pollitt,  1987]  developed  a  prototype  system  which  aids  searches  of  cancer  literature. 
Walker  and  Janes  [Walker  L  Janes,  1984]  have  been  working  for  several  years  on  a  system 
to  help  search  part  of  the  Chemical  Abstracts  database.  More  recently,  the  PLEXUS 
referral  expert  system  [Vickery  k  Brooks,  1987]  has  combined  a  natural  language  interface 
with  a  knowledge  base  that  contains  both  domain  knowledge  on  gardening  with  strategies 
for  searching  that  domain.  Each  of  these  systems  represents  an  important  contribution 
to  the  future  of  information  retrieval,  however  all  are  tailored  to  searching  bibliographic 
databases  in  only  one  specific  content  area. 

What  is  badly  needed  is  a  system  that  cam  work  with  different  full-text  databases,  that 
can  be  used  by  the  end  user,  and  that  can  moderate  the  output  so  that  the  user  is  not 
inundated  .  The  system  we  are  developing  represents  one  step  toward  these  goals. 

1.3  Functional  Overview 

The  expert  system  we  are  developing  will  serve  as  the  front-end  to  a  full-text  database. 
Our  goal  is  to  provide  many  of  the  benefits  of  a  search  intermediary  without  the  drawbacks. 
The  user  will  interact  with  the  expert  system  in  a  high-level  query  language.  The  expert 
system  will  deal  with  the  technical  details  of  the  textbase.  It  will  also  work  with  the  user 
to  refine  the  query  if  the  initial  search  is  unsatisfactory.  If  the  search  produces  too  many 
passages,  the  expert  system  will  reformulate  the  query  by  tightening  constraints.  If  it 
produces  too  few,  it  will  reformulate  the  query  by  loosening  constraints  and/or  expanding 
the  search  terms.  When  an  appropriate  number  of  passages  have  been  identified,  the  expert 
system  will  rank  order  them  in  terms  of  their  probably  interest  to  the  user.  Throughout 
the  process,  the  user  remains  an  active  participant  in  the  search,  but  with  the  system 
assuming  responsibility  for  much  of  the  detail. 

2  System  Architecture 

The  system  we  are  developing  has  five  major  components: 

1)  MICROARRAS  which  serves  as  the  full-text  search  and  retrieval  engine 

21  a  full-text  database 

31  a  hierarchical  thesaurus  of  words  specific  to  the  textbase’s  domain 

4)  an  expert  system  which  interprets  the  user’s  queries,  controls  the  search  process, 
analyses  the  retrieved  text,  and  ranks  the  search  results 

5)  a  user  interface  which  accepts  the  user’s  queries,  presents  requests  for  information 
from  the  expert  system,  and  displays  the  search  results.  It  is  not  a  major  thrust  of 
this  project,  and  is  not  discussed  further  in  this  paper. 
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The  system  is  being  implemented  on  a  Sun  3  workstation.  MICROARRAS  is  written 
in  the  C  language.  The  textual  database  for  our  current  demonstration  project  consists 
of  an  unpublished  manuscript  on  computer  architecture  written  by  F.  P.  Brooks,  Jr.,  and 
Gerard  Blaauw  [Brooks  &  Blaauw,  1987].  The  thesaurus  construction  and  access  routines 
are  also  written  in  C.  For  an  expert  system  shell,  we  are  using  OPS83. 


User 


Figure  2.1  System  Architecture 

The  search  process  consists  of  a  dialogue  between  the  user  and  the  expert  system.  The 
user  enters  the  initial  contextual  Boolean  query  which  the  expert  system  translates  into 
a  request  for  information  from  MICROARRAS.  MICROARRAS  retrieves  text  passages 
from  the  full-text  database  and  informs  the  expert  system  of  the  number  of  passages  that 
satisfy  the  request.  The  expert  system  evaluates  the  search  results  and  decides  whether  or 
not  to  reformulate  the  query. 

To  expand  a  search  query,  the  expert  system  may  use  three  different  strategies,  alone 
or  in  combination.  Using  the  thesaurus,  it  can  expand  individual  search  terms  to  the  set  of 
synonyms  contained  in  the  domain  specific  thesaurus.  Since  the  thesaurus  is  structured  as 
a  tree,  this  process  can  be  iterated  several  times  to  include  ancestor  as  well  as  cousin  sets. 
Second,  it  can  relax  contextual  constraints.  MICRO  ARRAS  provides  complete  generality 
in  terms  of  segmental  contexts.  Thus,  search  expressions  may  contain  contextual  parame¬ 
ters  in  terms  of  any  number  of  words,  sentences,  paragraphs,  etc.  to  either  the  right  or  left 
of  any  term  in  the  search  expression.  Thus,  the  expert  system  can  increase  the  number  of 
such  units  to  generate  more  potential  hits.  Finally,  it  can  change  the  Boolean  operators, 
making  the  query  less  restrictive. 


To  restrict  a  search,  the  expert  system  uses  the  same  strategies  as  those  described 
above,  but  in  reverse.  That  is,  it  may  reduce  sets  of  search  terms  to  only  the  head  term 
listed  in  the  thesaurus,  contract  contexts,  and  replace  Boolean  operators. 


Once  an  appropriate  number  of  passages  are  identified,  the  expert  system  attempts  to 
rank  order  them  in  terms  of  probable  relevance.  It  does  this  by  performing  a  rudimentary 
content  analysis  on  the  passages  retrieved  by  MICRO  ARRAS  and  computing  a  relevance 
index  for  each.  The  relevance  index  for  each  passage  is  a  function  of  the  number  of  search 
terms  actually  found  in  that  passage,  the  number  of  distinct  types  for  each  (for  terms 
that  are  sets),  and  the  number  of  different  thesaural  categories  represented.  The  retrieved 
passages  are  then  sorted  by  relevance  term  and  presented  to  the  user  in  order  of  probable 
interest. 


A  major  advantage  of  this  architecture  is  the  separation  of  strategic  knowledge,  con¬ 
tained  in  the  knowledge  base  for  the  expert  system,  from  domain  knowledge,  contained  in 
the  thesaurus.  Once  the  search  strategy  rules  have  been  developed  and  tested  with  the 
existing  textbase,  the  expert  system  could  be  extended  to  other  content  domains  by  simply 
providing  a  suitable  thesaurus  for  the  new  textbase. 
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3  MICROARRAS 


3.1  Capabilities 

MICRO  ARRAS  is  an  advanced  full-text  retrieval  and  analysis  system  [Smith,  Weiss, 
&  Ferguson,  1986].  The  system  provides  immediate  access  to  any  passage  in  the  text  base, 
regardless  of  the  length  of  that  document.  Users  can  browse  through  a  document’s  vo¬ 
cabulary  as  well  as  its  text.  MICROARRAS  also  provides  Boolean  search  on  any  word 
or  set  of  words  in  the  text.  Contexts  for  searches  can  be  indicated  in  terms  of  words, 
sentences,  paragraphs,  etc.,  for  the  entire  search  expression  or  for  different  parts  of  it.  One 
particularly  important  feature  for  this  project  is  a  generalized  categorization  option  by 
which  one  may  define  sets  of  words  or  text  locations  as  well  as  recursive  categories  whose 
members  are,  themselves,  categories.  Any  command  that  accepts  a  word  as  a  parameter 
will  accept  a  category  name  instead.  Thus,  categories  can  be  used  in  search  expressions, 
making  MICROARRAS  particularly  well-suited  to  work  with  a  hierarchical  thesaurus. 
MICROARRAS  can  also  compute  and  report  various  frequency  of  occurrence  statistics  in 
the  form  of  distribution  vectors  over  a  text  or  set  of  texts. 

To  be  inserted  into  MICROARRAS’  textbase,  documents  must  first  be  inverted.  How¬ 
ever,  they  require  no  semantic  preprocessing.  Once  stored  in  the  textbase,  they  can  be 
examined  individually  or  in  groups.  They  can  also  be  moved  from  one  textbase  to  another. 
Thus,  documents  can  be  processed  on  a  workstation  or  microcomputer,  uploaded  into  a 
textbase  on  a  mainframe  or  textbase  server,  searched  and  analyzed  there,  or  downloaded 
for  local  use  once  again. 

3.2  FLANGE 

FLANGE  is  a  two-way  command  language  that  was  developed  as  part  of  the  MI¬ 
CROARRAS  system.  Consequently,  it  serves  two  major  functions:  it  provides  commu¬ 
nication  between  the  user  interface  and  the  analytic  engine  that  performs  all  search  and 
analysis  operation,  and  it  provides  a  formal  specification  for  the  system.  It  is  written  in 
a  BNF-like  notation.  Consequently,  programs  can  easily  construct  command  expressions 
which,  in  turn,  can  easily  be  parsed.  Additionally,  the  components  of  a  FLANGE  “sen¬ 
tence"  are  strongly  typed  to  further  simplify  processing  and  to  ensure  reliable  transmission 
across  a  communication  interface. 

One  particularly  useful  feature  of  FLANGE  is  its  two-way  communication  capabilities. 
The  following  example  shows  a  typical  interaction  between  MICROARRAS’  user  interface 
program  and  its  analytic  engine.  Suppose  the  user  wishes  MICROARRAS  to  display 
concordance  information  for  a  particular  word  in  a  text  in  the  textbase.  The  user’s  request 
for  a  concordance  is  first  translated  by  the  interface  program  into  a  FLANGE  expression. 
That  expression  is  then  sent  to  the  MICROARRAS  engine,  either  running  on  the  same 
machine  or  on  a  remote  computer.  The  engine  parses  the  message  and  performs  the 
operation  requested.  It  then  encodes  the  results  in  the  conventions  of  the  return  portion 
of  FLANGE  and  sends  that  message  to  the  user  interface.  The  user  interface  parses  the 
messages,  interprets  the  result,  and  either  displays  the  requested  information  to  the  user 
or  engages  the  engine  in  a  further  FLANGE  dialogue. 

It  is  FLANGE’S  capability  of  providing  a  formal  high-level  text  analysis  language  and 
its  capability  of  delivering  its  results  in  a  structured  and  typed  form  -  rather  than  as  a 
stream  of  data  -  that  makes  it  feasible  for  an  expert  system  to  work  iteratively  with  the 
textbase. 


4  Text  base 


The  manuscript  by  Brooks  and  Blaauw  we  are  using  for  our  current  project  consists  of 
some  188,278  words.  While  this  text  base  is  small  compared  to  large  commercial  databases, 
it  is  large  enough  to  provide  a  realistic  demonstration  environment. 

Texts  to  be  used  as  MICRO  ARRAS  textbases  require  initial  processing.  First,  format 
marks  of  interest  to  users  must  be  inserted  in  the  text.  For  this  text,  we  included  format 
marks  which  will  be  used  in  the  display  of  the  retrieved  text  (line,  tab,  italics,  line,  label) , 
as  well  as  those  which  provide  context  information  (section,  paragraph,  sentence,  item). 
Second,  a  series  of  programs  axe  run  on  the  text  to  produce  an  inverted  file.  Finally,  this 
inverted  file  is  converted  to  fixed  length  records  for  fast  access. 

5  Thesaurus 

All  domain-specific  information  is  contained  in  a  hierarchical  thesaurus.  In  future 
extensions,  this  thesaurus  will  apply  to  an  entire  database.  For  our  current  project,  it 
applies  only  to  the  Brooks  and  Blaauw  text. 

The  thesaurus  was  constructed  manually  from  the  8313  different  word  types  in  the 
textbase.  Removing  numbers,  punctuation,  stop  words,  proper  names,  and  words  which 
appeared  only  once  left  5726  types.  These  were  grouped  into  1993  stem  groups.  Common 
word  forms  missing  from  the  stem  groups  were  added,  bringing  the  total  to  6990  types. 
936  technical  word  stem  groups  were  selected  from  the  1993  to  be  arranged  hierarchically 
in  the  thesaurus.  Finally,  extremely  high  frequency  stem  groups  were  combined  to  form 
more  precise  compound  terms. 

Conceptually,  a  thesaurus  group  is  viewed  as  a  node  in  a  lattice  structure.  Each  node 
contains  a  name,  a  list  of  synonym  stem  groups,  the  names  of  one  or  more  parent  nodes, 
and  the  names  of  zero  or  more  children  nodes.  Parent  nodes  -  nodes  higher  in  the  thesaurus 
structure  -  represent  more  general  concepts  than  the  current  node.  Children  nodes  -  nodes 
lower  in  the  thesaurus  structure  -  represent  more  specific  terms.  For  example,  consider 
the  thesaurus  entry  for  Stack. 

Node  Name:  Stack 

Node  Wordstems:  stack,  lifo 

Parent  Node(s):  Data  Structure 

Children  Noaes(s):  Pop,  Top,  Push,  Index  Arithmetic 

6  Expert  System 

The  expert  system  performs  two  main  functions:  it  reformulates  the  Boolean  query 
based  on  previous  search  results,  and  it  ranks  the  retrieved  passages  in  decreasing  order 
of  relevance  for  presentation  to  the  user.  To  perform  these  functions,  it  uses  a  knowledge 
base  of  search  strategies  and  text  analysis  procedures.  As  we  pointed  out  above,  all  domain 
knowledge  is  contained  in  the  thesaurus. 

6.1  Query  Formulation 

To  invoke  the  system,  the  user  forms  the  initial  Boolean  query.  The  expert  system 
then  receives  the  query,  assumes  a  default  context  of  one  sentence,  maps  the  query  into 
FLANGE  (the  MICROARRAS  two-way  control  language  mentioned  above),  and  sends  the 
FLANGE  query  to  the  MICROARRAS  engine.  The  engine  performs  the  search,  packages 
the  results  into  FLANGE,  and  sends  the  formatted  message  back  to  the  expert  system. 
The  expert  system  unpackages  the  FLANGE  message  and  decides  whether  to  display  the 
results  to  the  user  or  whether  to  reformulate  the  query. 
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6.2  Query  Reformulation 

Following  the  initial  search,  the  decision  to  reformulate  the  query  is  based  on  estimates 
of  the  recall  (the  number  of  passages  identified  by  the  search)  and  the  precision  (the  percent 
of  retrieved  passages  that  are  relevant).  The  expert  system  makes  these  estimates  using 
frequency  statistics  for  the  query  terms  in  the  textbase  as  a  whole,  their  frequency  in 
the  retrieved  passages,  and  the  number  of  passages  retrieved.  If  the  system  decides  to 
reformulate  the  query,  it  may  do  so  by  manipulating  three  different  variables,  alone  or  in 
combination.  They  are  the  number  of  context  units  between  terms;  the  search  terms;  and 
the  Boolean  operators. 

If  the  recall  is  very  low,  the  query  will  be  broadened  to  match  more  passages.  This  may 
be  done  by  replacing  individual  words  in  the  search  expression  by  sets  of  words  (categories, 
in  MICRO  ARRAS’  terms).  Initially,  the  expert  system  expands  words  into  root  groups 
(words  with  the  same  stem).  Next,  it  replaces  words  with  synonym  sets.  If  necessary 
related  word  sets  will  be  concatenated.  In  each  case,  the  sets  added  are  derived  from  the 
hierarchical  thesaurus.  Alternatively,  contextual  constraints  can  be  relaxed,  so  that  the 
search  extends  over  adjacent  sentences,  the  whole  paragraph,  adjacent  paragraphs,  etc. 
Finally,  the  Boolean  operators  may  be  changed  from  “and”  to  “or” ,  ot  by  removing  “not” 
components  from  the  expression. 

Recall  that  is  too  high  usually  does  not  pose  a  problem  so  long  as  the  precision  remains 
high.  Since  the  retrieved  passages  will  be  displayed  in  decreasing  order  of  relevance,  the 
user  can  simply  stop  reading  the  passages  whenever  he  wishes.  The  only  time  this  is  a 
problem  is  when  the  number  is  so  large  that  it  requires  excessive  time  to  rank-order  them. 
In  those  cases,  the  expert  system  manipulates  the  three  variables  in  reverse. 

If  precision  is  too  low  -  i.e.  too  many  irrelevant  passages  are  retrieved  -  a  more 
specific  search  expression  is  required.  In  this  case,  the  expert  system  first  tries  increasing 
the  contextual  constraints.  If  this  does  not  produce  the  desired  results,  it  replaces  query 
terms  with  more  specific  terms.  The  system  derives  a  candidate  term  from  the  thesaurus 
and  then  asks  the  user  for  confirmation  before  reformulating  the  query.  A  final  strategy 
involves  changing  the  Boolean  operators. 

Precision  cannot  really  be  too  high,  since  ideally  all  relevant  passages  and  no  irrelevant 
passages  would  be  retrieved.  However,  high  precision  may  mask  another  problem.  It  may 
indicate  that  the  query  was  not  broad  enough  and  that,  in  fact,  recall  was  low.  This 
possibility  was  discussed  above. 

The  expert  system  must  decide  when  to  stop  the  reformulation  process.  This  decision 
will  be  based  on  a  combination  of  user  supplied  a  priori  knowledge  of  the  amount  of 
information  desired  and  analysis  of  the  results  of  the  searches  to  date.  Certainly,  the 
expert  system  will  try  to  improve  recall  if  nothing  at  all  is  retrieved.  Similarly,  if  a  great 
many  passages  are  retrieved,  precision  must  be  improved.  If  the  results  of  this  query  are 
worse  than  previous  results,  the  expert  system  will  backtrack  to  an  earlier  query.  Finally, 
if  the  expert  system  runs  out  of  things  to  try,  control  is  returned  to  the  user  regardless  of 
the  amount  of  information  retrieved. 

6.3  Relevance  Ranking 

The  dialogue  between  the  expert  system  and  MICRO  ARRAS  normally  produces  a  set 
of  passages  to  be  displayed  to  the  user.  The  last  task  performed  by  the  expert  system 
is  to  rank  order  those  passages  in  terms  of  their  probable  interest  to  the  user.  To  do 
this,  it  performs  an  elementary  content  analysis  on  each  passage  and  computes  an  index 
of  probable  interest.  Factors  which  affect  this  index  value  are  the  number  of  different 
concepts  represented  in  the  passage,  the  number  of  different  word  types  for  each  concept, 
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the  number  of  tokens  for  each  word  type  from  the  search  expression  appearing  in  the 
passage,  and  the  contextual  distance  between  search  terms. 

The  passages  are  then  ranked  according  to  their  respective  index  values  and  presented 
to  the  user  in  decreasing  order  of  relevance. 

7  Experiments 

Gerard  Salton  [Salton,  1983]  describes  two  measures  of  performance:  system  effective¬ 
ness  and  efficiency.  Basically,  the  effectiveness  of  an  information  system  is  a  measure  of  the 
system  performance  whereas  efficiency  is  a  measure  of  the  amount  of  user  effort  required 
to  perform  a  task.  Once  the  system  is  built,  we  will  run  controlled  experiments  to  test 
whether  the  expert  system  can  improve  a  novice  searcher’s  effectiveness  and  efficiency. 

We  will  use  Computer  Science  graduate  students  as  the  subjects  since  they  are  profi¬ 
cient  computer  users  but  novice  searchers.  The  subjects  will  be  asked  to  perform  several 
retrieval  tasks  differing  in  the  amount  of  information  to  be  retrieved  and  the  difficulty  of 
the  searches.  Each  subject  will  perform  searches  with  and  without  the  expert  system  front 
end,  and  data  will  be  collected  to  evaluate  their  performance.  The  subjects  will  also  be 
asked  for  relevance  feedback  on  the  passages  retrieved.  Effectiveness  will  be  measured  by 
precision  and  recall,  and  efficiency  will  be  measured  by  the  time  necessary  to  perform  each 
search.  The  system  effectiveness  and  user  efficiency,  with  and  without  the  expert  system, 
will  be  compared  to  evaluate  the  the  impact  of  the  online  search  assistant. 

8  Conclusion 

8.1  Current  Status 

The  text  retrieval  software,  textbase,  and  thesaurus  are  complete;  and  the  high  level 
strategies  for  the  expert  system  have  been  designed.  We  are  currently  writing  the  produc¬ 
tion  rules  to  be  used  by  the  expert  system.  We  expect  to  have  a  working  prototype  by 
early  1988  and  to  run  the  experiment  described  above  during  the  spring. 

8.2  Future  Work 

We  view  our  current  project  as  a  beginning,  rather  than  an  end  in  itself.  As  mentioned 
above,  it  is  intended  to  demonstrate  the  concept  of  using  an  expert  system  as  an  inter¬ 
mediary  function  between  a  user  interface  and  an  analytic  engine.  In  the  future,  we  will 
extend  the  search  and  analysis  operations  that  are  leveraged  by  the  expert  system.  These 
include  a  broader  range  of  retrieval  algorithms  and  more  sophisticated  content  analysis  to 
determine  probable  relevance.  We  will  also  explore  computing  and  interpreting  a  variety 
of  statistical  and  stylistic  measures,  and  we  plan  to  develop  an  informal  graphical  query 
language  in  which  to  specify  the  initial  search  request. 
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